Tag
19 articles
Learn to build a basic voice-controlled assistant app that recognizes spoken commands and responds with text-to-speech output, demonstrating the core technology behind modern voice assistants.
Learn to build a basic AI voice assistant that can understand spoken questions and respond with intelligent answers using Python and OpenAI's API.
Learn how voice AI works and why it's particularly challenging in India's diverse linguistic environment. Discover how companies like Wispr Flow are working to make voice technology more accessible.
Learn to build a basic speech-to-speech conversational AI system that processes voice input, generates intelligent responses, and speaks back to users.
This explainer explores the advanced AI technologies behind modern dictation apps, including transformer architectures, real-time processing, and multimodal learning techniques.
IBM has launched two new Granite Speech 4.1 2B models — one autoregressive for high-accuracy speech recognition with translation, and one non-autoregressive for fast inference.
OpenMOSS has released MOSS-Audio, an open-source foundation model that unifies speech, sound, music, and temporal audio reasoning, outperforming existing open-source models including systems more than four times its size.
This article explains how the Deepgram Python SDK enables developers to integrate advanced voice AI capabilities like transcription, text-to-speech, and asynchronous audio processing into Python applications.
Learn to build a basic voice translation application using DeepL's API and Python. This beginner-friendly tutorial teaches you how to capture voice input, translate it in real-time, and speak the results aloud.
Learn what Microsoft VibeVoice is, how it uses AI to understand and generate human speech, and why it's important for the future of voice technology.
Learn to build an offline speech-to-text application using Google's Gemma AI models with real-time audio capture and local inference capabilities.
Learn how to build a system that processes audio and video inputs to generate code, simulating the capabilities of multimodal AI models like Qwen3.5-Omni.